정규화 선형 회귀

정규화(regularized) 선형 회귀 방법은 선형 회귀 계수(weight)에 대한 제약 조건을 추가함으로써 모형이 과도하게 최적화되는 현상, 즉 과최적화를 막는 방법이다. Regularized Method, Penalized Method, Contrained Least Squares 이라고도 불리운다.

모형이 과도하게 최적화되면 모형 계수의 크기도 과도하게 증가하는 경향이 나타난다. 따라서 정규화 방법에서 추가하는 제약 조건은 일반적으로 계수의 크기를 제한하는 방법이다. 일반적으로 다음과 같은 세가지 방법이 사용된다.

Ridge 회귀 모형
Lasso 회귀 모형
Elastic Net 회귀 모형

Ridge 회귀 모형

Ridge 회귀 모형에서는 가중치들의 제곱합(squared sum of weights)을 최소화하는 것을 추가적인 제약 조건으로 한다.

$$ \begin{eqnarray} \text{cost} &=& \sum e_i^2 + \lambda \sum w_i^2 \end{eqnarray} $$

$\lambda$는 기존의 잔차 제곱합과 추가적 제약 조건의 비중을 조절하기 위한 하이퍼 모수(hyper parameter)이다. $\lambda$가 크면 정규화 정도가 커지고 가중치의 값들이 작아진다. $\lambda$가 작아지면 정규화 정도가 작아지며 $\lambda$ 가 0이 되면 일반적인 선형 회귀 모형이 된다.

Lasso 회귀 모형

Lasso(Least Absolute Shrinkage and Selection Operator) 회귀 모형은 가중치의 절대값의 합을 최소화하는 것을 추가적인 제약 조건으로 한다.

$$ \begin{eqnarray} \text{cost} &=& \sum e_i^2 + \lambda \sum | w_i | \end{eqnarray} $$

Elastic Net 회귀 모형

Elastic Net 회귀 모형은 가중치의 절대값의 합과 제곱합을 동시에 제약 조건으로 가지는 모형이다.

$$ \begin{eqnarray} \text{cost} &=& \sum e_i^2 + \lambda_1 \sum | w_i | + \lambda_2 \sum w_i^2 \end{eqnarray} $$

$\lambda_1$, $\lambda_2$ 두 개의 하이퍼 모수를 가진다.

statsmodels의 정규화 회귀 모형

statsmodels 패키지는 OLS 선형 회귀 모형 클래스의 fit_regularized 메서드를 사용하여 Elastic Net 모형 계수를 구할 수 있다.

http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.fit_regularized.html

하이퍼 모수는 다음과 같이 모수 $\text{alpha} $ 와 $\text{L1_wt}$ 로 정의된다.

$$ 0.5 \times \text{RSS}/N + \text{alpha} \times \big( 0.5 \times (1-\text{L1_wt})\sum w_i^2 + \text{L1_wt} \sum |w_i| \big) $$



In [1]:

    
np.random.seed(0)
n_samples = 30
X = np.sort(np.random.rand(n_samples))
y = np.cos(1.5 * np.pi * X) + np.random.randn(n_samples) * 0.1

dfX = pd.DataFrame(X, columns=["x"])
dfX = sm.add_constant(dfX)
dfy = pd.DataFrame(y, columns=["y"])
df = pd.concat([dfX, dfy], axis=1)

model = sm.OLS.from_formula("y ~ x + I(x**2) + I(x**3) + I(x**4) + I(x**5) + I(x**6) + I(x**7) + I(x**8) + I(x**9)", data=df)
result1 = model.fit()
result1.params









    Out[1]:





Intercept        0.830090
x               19.456558
I(x ** 2)     -439.348916
I(x ** 3)     3909.347642
I(x ** 4)   -18329.176673
I(x ** 5)    49280.750802
I(x ** 6)   -78994.113652
I(x ** 7)    74769.482237
I(x ** 8)   -38598.716765
I(x ** 9)     8381.793811
dtype: float64



In [2]:

    
def plot_statsmodels(result):
    plt.scatter(X, y)
    xx = np.linspace(0, 1, 1000)
    dfxx = pd.DataFrame(xx, columns=["x"])
    dfxx = sm.add_constant(dfxx)
    plt.plot(xx, result.predict(dfxx).values)
    plt.show()
    
plot_statsmodels(result1)

모수 L1_wt가 0 이면 순수 Ridge 모형이 된다.



In [3]:

    
result2 = model.fit_regularized(alpha=0.01, L1_wt=0)
print(result2.params)
plot_statsmodels(result2)









    



Intercept    0.747697
x           -2.053341
I(x ** 2)   -0.962382
I(x ** 3)   -0.139131
I(x ** 4)    0.297857
I(x ** 5)    0.484305
I(x ** 6)    0.529308
I(x ** 7)    0.501820
I(x ** 8)    0.442001
I(x ** 9)    0.371889
dtype: float64

반대로 모수 L1_wt가 1 이면 순수 Lasso 모형이 된다.



In [4]:

    
result3 = model.fit_regularized(alpha=0.01, L1_wt=1)
print(result3.params)
plot_statsmodels(result3)









    



Intercept    0.872429
x           -2.900711
I(x ** 2)    0.000000
I(x ** 3)    0.000000
I(x ** 4)    0.000000
I(x ** 5)    0.918716
I(x ** 6)    0.627920
I(x ** 7)    0.354740
I(x ** 8)    0.139104
I(x ** 9)    0.000000
dtype: float64

모수 L1_wt가 0과 1 사이이면 Elastic Net 모형이다.



In [5]:

    
result4 = model.fit_regularized(alpha=0.01, L1_wt=0.5)
print(result4.params)
plot_statsmodels(result4)









    



Intercept    0.791629
x           -2.485280
I(x ** 2)   -0.334919
I(x ** 3)    0.000000
I(x ** 4)    0.018108
I(x ** 5)    0.440759
I(x ** 6)    0.549247
I(x ** 7)    0.498811
I(x ** 8)    0.388577
I(x ** 9)    0.268337
dtype: float64

Scikit-Learn의 정규화 회귀 모형

Scikit-Learn 패키지에서는 정규화 회귀 모형을 위한 Ridge, Lasso, ElasticNet 이라는 별도의 클래스를 제공한다. 각 모형에 대한 최적화 목적 함수는 다음과 같다.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

$$ \text{RSS} + \text{alpha} \sum w_i^2 $$

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

$$ 0.5 \times \text{RSS}/N + \text{alpha} \sum |w_i| $$

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

$$ 0.5 \times \text{RSS}/N + 0.5 \times \text{alpha} \times \big(0.5 \times (1-\text{l1_ratio})\sum w_i^2 + \text{l1_ratio} \sum |w_i| \big) $$



In [6]:

    
def plot_sklearn(model):
    plt.scatter(X, y)
    xx = np.linspace(0, 1, 1000)
    plt.plot(xx, model.predict(xx[:, np.newaxis]))
    plt.show()



In [7]:

    
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

poly = PolynomialFeatures(9)
model = make_pipeline(poly, LinearRegression()).fit(X[:, np.newaxis], y)
print(model.steps[1][1].coef_)
plot_sklearn(model)









    



[  0.00000000e+00   1.94565584e+01  -4.39348916e+02   3.90934764e+03
  -1.83291767e+04   4.92807508e+04  -7.89941136e+04   7.47694822e+04
  -3.85987168e+04   8.38179381e+03]



In [8]:

    
model = make_pipeline(poly, Ridge(alpha=0.01)).fit(X[:, np.newaxis], y)
print(model.steps[1][1].coef_)
plot_sklearn(model)









    



[ 0.         -3.57707392 -1.24253699  0.65176155  1.39935489  1.3811883
  0.96006837  0.37104071 -0.25111269 -0.83604486]



In [9]:

    
model = make_pipeline(poly, Lasso(alpha=0.01)).fit(X[:, np.newaxis], y)
print(model.steps[1][1].coef_)
plot_sklearn(model)









    



[ 0.         -3.03534727 -0.          0.          0.          0.
  2.13679879  0.          0.          0.        ]



In [10]:

    
model = make_pipeline(poly, ElasticNet(alpha=0.01, l1_ratio=0.5)).fit(X[:, np.newaxis], y)
print(model.steps[1][1].coef_)
plot_sklearn(model)









    



[ 0.         -2.57943186 -0.26229281  0.          0.03701941  0.38586371
  0.4994734   0.48920509  0.41804678  0.32084092]

정규화 모형의 장점

정규화 모형은 회귀 분석에 사용된 데이터가 달라져도 계수가 크게 달라지지 않도록 한다.



In [11]:

    
X_train = np.c_[.5, 1].T
y_train = [.5, 1]
X_test = np.c_[-1, 3].T
np.random.seed(0)

models = {"LinearRegression": LinearRegression(), 
          "Ridge": Ridge(alpha=0.1)}

for i, (name, model) in enumerate(models.iteritems()):
    ax = plt.subplot(1, 2, i+1)
    for _ in range(10):
        this_X = .1 * np.random.normal(size=(2, 1)) + X_train
        model.fit(this_X, y_train)
        ax.plot(X_test, model.predict(X_test), color='.5')
        ax.scatter(this_X, y_train, s=100, c='.5', marker='o', zorder=10)
        model.fit(X_train, y_train)
        ax.plot(X_test, model.predict(X_test), linewidth=3, color='blue', alpha=0.5)
        ax.scatter(X_train, y_train, s=100, c='r', marker='D', zorder=10)
        plt.title(name)
        ax.set_xlim(-0.5, 2)
        ax.set_ylim(0, 1.6)

Ridge 모형과 Lasso 모형의 차이

Ridge 모형은 가중치 계수를 한꺼번에 축소시키는데 반해 Lasso 모형은 일부 가중치 계수가 먼저 0으로 수렴하는 특성이 있다.



In [12]:

    
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target



In [13]:

    
ridge0 = Ridge(alpha=0).fit(X, y)
p0 = pd.Series(np.hstack([ridge0.intercept_, ridge0.coef_]))
ridge1 = Ridge(alpha=1).fit(X, y)
p1 = pd.Series(np.hstack([ridge1.intercept_, ridge1.coef_]))
ridge2 = Ridge(alpha=2).fit(X, y)
p2 = pd.Series(np.hstack([ridge2.intercept_, ridge2.coef_]))
pd.DataFrame([p0, p1, p2]).T









    Out[13]:






  
    
      
      0
      1
      2
    
  
  
    
      0
      152.133484
      152.133484
      152.133484
    
    
      1
      -10.012198
      29.465746
      33.684368
    
    
      2
      -239.819089
      -83.154885
      -41.040187
    
    
      3
      519.839787
      306.351627
      223.029964
    
    
      4
      324.390428
      201.629434
      152.203371
    
    
      5
      -792.184162
      5.909369
      20.941211
    
    
      6
      476.745838
      -29.515927
      -2.749722
    
    
      7
      101.044570
      -152.040465
      -121.063689
    
    
      8
      177.064176
      117.311715
      103.717329
    
    
      9
      751.279321
      262.944995
      195.099906
    
    
      10
      67.625386
      111.878718
      99.467716



In [14]:

    
lasso0 = Lasso(alpha=0.0001).fit(X, y)
p0 = pd.Series(np.hstack([lasso0.intercept_, lasso0.coef_]))
lasso1 = Lasso(alpha=0.1).fit(X, y)
p1 = pd.Series(np.hstack([lasso1.intercept_, lasso1.coef_]))
lasso2 = Lasso(alpha=10).fit(X, y)
p2 = pd.Series(np.hstack([lasso2.intercept_, lasso2.coef_]))
pd.DataFrame([p0, p1, p2]).T









    Out[14]:






  
    
      
      0
      1
      2
    
  
  
    
      0
      152.133484
      152.133484
      152.133484
    
    
      1
      -9.910816
      -0.000000
      0.000000
    
    
      2
      -239.727144
      -155.362882
      0.000000
    
    
      3
      519.881966
      517.182017
      0.000000
    
    
      4
      324.294322
      275.082351
      0.000000
    
    
      5
      -784.988701
      -52.540269
      0.000000
    
    
      6
      471.210031
      -0.000000
      0.000000
    
    
      7
      97.612539
      -210.159753
      -0.000000
    
    
      8
      175.802361
      0.000000
      0.000000
    
    
      9
      748.684614
      483.914409
      0.000000
    
    
      10
      67.610404
      33.672821
      0.000000

path 메서드

Lasso 와 ElasticNet 클래스는 하이퍼 모수 alpha 값의 변화에 따른 계수의 변화를 자동으로 계산하는 path 메서드를 제공한다. lasso_path(), enet_path() 명령어도 path 메서드와 동일한 기능을 수행한다.



In [15]:

    
lasso = Lasso()
alphas, coefs, _ = lasso.path(X, y, alphas=np.logspace(-6, 1, 8))
df = pd.DataFrame(coefs, columns=alphas)
df









    Out[15]:






  
    
      
      10.0
      1.0
      0.1
      0.01
      0.001
      0.0001
      1e-05
      1e-06
    
  
  
    
      0
      0.0
      0.359018
      10.286331
      33.147694
      8.705495
      -5.452332
      -9.080737
      -9.902637
    
    
      1
      0.0
      0.000000
      0.285976
      -35.245064
      -178.075235
      -230.060677
      -238.307326
      -239.649466
    
    
      2
      0.0
      3.259767
      37.464643
      211.024640
      450.882945
      517.356795
      520.767952
      519.973879
    
    
      3
      0.0
      2.204356
      27.544898
      144.560242
      281.073242
      317.436537
      323.234112
      324.258708
    
    
      4
      0.0
      0.528646
      11.108848
      21.931303
      -44.061899
      -240.402810
      -635.231952
      -772.936350
    
    
      5
      0.0
      0.250935
      8.355882
      0.000000
      -77.939017
      40.281395
      352.230141
      461.473758
    
    
      6
      -0.0
      -1.861363
      -24.120806
      -115.619846
      -188.950843
      -136.775239
      31.891364
      92.542411
    
    
      7
      0.0
      2.114454
      25.505488
      100.658528
      119.797535
      117.921788
      158.412929
      174.749027
    
    
      8
      0.0
      3.105841
      35.465757
      185.326164
      393.706827
      533.930194
      691.527537
      743.979234
    
    
      9
      0.0
      1.769851
      22.894981
      96.257133
      98.943979
      73.934816
      68.648179
      67.740903



In [16]:

    
df.T.plot(logx=True)
plt.show()

	0	1	2
0	152.133484	152.133484	152.133484
1	-10.012198	29.465746	33.684368
2	-239.819089	-83.154885	-41.040187
3	519.839787	306.351627	223.029964
4	324.390428	201.629434	152.203371
5	-792.184162	5.909369	20.941211
6	476.745838	-29.515927	-2.749722
7	101.044570	-152.040465	-121.063689
8	177.064176	117.311715	103.717329
9	751.279321	262.944995	195.099906
10	67.625386	111.878718	99.467716

	10.0	1.0	0.1	0.01	0.001	0.0001	1e-05	1e-06
0	0.0	0.359018	10.286331	33.147694	8.705495	-5.452332	-9.080737	-9.902637
1	0.0	0.000000	0.285976	-35.245064	-178.075235	-230.060677	-238.307326	-239.649466
2	0.0	3.259767	37.464643	211.024640	450.882945	517.356795	520.767952	519.973879
3	0.0	2.204356	27.544898	144.560242	281.073242	317.436537	323.234112	324.258708
4	0.0	0.528646	11.108848	21.931303	-44.061899	-240.402810	-635.231952	-772.936350
5	0.0	0.250935	8.355882	0.000000	-77.939017	40.281395	352.230141	461.473758
6	-0.0	-1.861363	-24.120806	-115.619846	-188.950843	-136.775239	31.891364	92.542411
7	0.0	2.114454	25.505488	100.658528	119.797535	117.921788	158.412929	174.749027
8	0.0	3.105841	35.465757	185.326164	393.706827	533.930194	691.527537	743.979234
9	0.0	1.769851	22.894981	96.257133	98.943979	73.934816	68.648179	67.740903